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Abstract. In this paper Schapire and Singer's AdaBoost.MH boosting 
algorithm is applied to the Word Sense Disambiguation (WSD) problem. 
Initial experiments on a set of 15 selected polysemous words show that 
the boosting approach surpasses Naive Bayes and Exemplar-based ap- 
proaches, which represent state-of-the-art accuracy on supervised WSD. 
In order to make boosting practical for a real learning domain of thou- 
sands of words, several ways of accelerating the algorithm by reducing the 
feature space are studied. The best variant, which we call LazyBoosting, 
is tested on the largest sense-tagged corpus available containing 192,800 
examples of the 191 most frequent and ambiguous English words. Again, 
boosting compares favourably to the other benchmark algorithms. 



1 Introduction 

Word Sense Disambiguation (WSD) is the problem of assigning the appro- 
priate meaning (sense) to a given word in a text or discourse. This meaning is dis- 
tinguishable from other senses potentially attributable to that word. Resolving 
the ambiguity of words is a central problem for language understanding applica- 
tions and their associated tasks pT[ |, including, for instance, machine translation, 
information retrieval and hypertext navigation, parsing, spelling correction, ref- 
erence resolution, automatic text summarization, etc. 

WSD is one of the most important open problems in the Natural Language 
Processing (NLP) field. Despite the wide range of approaches investigated and 
the large effort devoted to tackling this problem, it is a fact that to date no large- 
scale, broad coverage and highly accurate word sense disambiguation system has 
been built. 

The most successful current line of research is the corpus-based approach 
in which statistical or Machine Learning (ML) algorithms have been applied to 
learn statistical models or classifiers from corpora in order to perform WSD. Gen- 
erally, supervised approaches (those that learn from a previously semantically 
annotated corpus) have obtained better results than unsupervised methods on 
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project and CIRIT's grant 1999FI 00773). 



small sets of selected highly ambiguous words, or artificial pseudo-words. Many 
standard ML algorithms for supervised learning have been applied, such as: Naive 
Bayes p^,^ , , Exemplar-based learning Decision Lists , Neural Net- 

works |2j, etc. Further, Mooney jl^ has also compared all previously cited 
methods on a very restricted domain and including Decision Trees and Rule 
Induction algorithms. Unfortunately, there have been very few direct compar- 
isons of alternative methods on identical test data. However, it is commonly 
accepted that Naive Bayes, Neural Networks and Exemplar-based learning rep- 
resent state-of-the-art accuracy on supervised WSD. 

Supervised methods suffer from the lack of widely available semantically 
tagged corpora, from which to construct really broad coverage systems. This 
is known as the "knowledge acquisition bottleneck" . Ng |^ estimates that the 
manual annotation effort necessary to build a broad coverage semantically an- 
notated corpus would be about 16 man- years. This extremely high overhead 
for supervision and, additionally, the also serious overhead for learning/testing 
many of the commonly used algorithms when scaling to real size WSD problems, 
explain why supervised methods have been seriously questioned. 

Due to this fact, recent works have focused on reducing the acquisition cost 
as well as the need for supervision in corpus-based methods for WSD. Conse- 
quently, the following three lines of research can be found: 1) The design of 
efficient example sampling methods 2) The use of lexical resources, such 

as WordNet , and WWW search engines to automatically obtain from Inter- 
net arbitrarily large samples of word senses [^2|jl^; 3) The use of unsupervised 
EM-like algorithms for estimating the statistical model parameters [Q. It is 
also our belief that this body of work, and in particular the second line, provides 
enough evidence towards the "opening" of the acquisition bottleneck in the near 
future. For that reason, it is worth further investigating the application of new 
supervised ML methods to better resolve the WSD problem. 

Boosting Algorithms. The main idea of boosting algorithms is to combine 
many simple and moderately accurate hypotheses (called weak classifiers) into 
a single, highly accurate classifier for the task at hand. The weak classifiers are 
trained sequentially and, conceptually, each of them is trained on the examples 
which were most difficult to classify by the preceding weak classifiers. 

The AdaBoost.MH algorithm applied in this paper is a generalization 
of Freund and Schapire's AdaBoost algorithm which has been (theoretically 
and experimentally) studied extensively and which has been shown to perform 
well on standard machine-learning tasks using also standard machine-learning 
algorithms as weak learners . 

Regarding Natural Language (NL) problems, AdaBoost.MH has been suc- 
cessfully applied to Part-of-Speech (PoS) tagging Prepositional-Phrase- 
attachment disambiguation [Q, and. Text Categorization with especially 
good results. 

The Text Categorization domain shares several properties with the usual 
settings of WSD, such as: very high dimensionality (typical features consist in 
testing the presence/absence of concrete words), presence of many irrelevant and 



highly dependent features, and the fact that both, the learned concepts and the 
examples, reside very sparsely in the feature space. Therefore, the application 
of AdaBoost.MH to WSD seems to be a promising choice. It has to be noted 
that, apart from the excellent results obtained on NL problems, AdaBoost.MH 
has the advantages of being theoretically well founded and easy to implement. 

The paper is organized as follows: Section |^ is devoted to explain in detail 
the AdaBoost.MH algorithm. Section || describes the domain of application and 
the initial experiments performed on a reduced set of words. In Section^ several 
alternatives are explored for accelerating the learning process by reducing the 
feature space. The best alternative is fully tested in Section ||. Finally, Section ^ 
concludes and outlines some directions for future work. 



2 The Boosting Algorithm AdaBoost.MH 

This section describes the Schapire and Singer's AdaBoost.MH algorithm for 
multiclass multi-label classification, using exactly the same notation given by 
the authors in p^ , ^ . 

As already said, the purpose of boosting is to find a highly accurate classifi- 
cation rule by combining many weak hypotheses (or weak rules), each of which 
may be only moderately accurate. It is assumed that there exists a separate pro- 
cedure called the WeakLearner for acquiring the weak hypotheses. The boosting 
algorithm finds a set of weak hypotheses by calling the weak learner repeatedly 
in a series of T rounds. These weak hypotheses are then combined into a single 
rule called the combined hypothesis. 

Let S = {(xi, Yi), . . . , (xm, Ym)} be the set of m training examples, where 
each instance Xi belongs to an instance space X and each Yi is a subset of a 
finite set of labels or classes 3^. The size of y is denoted by fc = |3^|. 

The pseudo-code of AdaBoost.MH is presented in figure |^. AdaBoost.MH 
maintains an m x fc matrix of weights as a distribution D over examples and 
labels. The goal of the WeakLearner algorithm is to find a weak hypothesis with 
moderately low error with respect to these weights. Initially, the distribution Di 
is uniform, but the boosting algorithm updates the weights on each round to 
force the weak learner to concentrate on the pairs (examples, label) which are 
hardest to predict. 

More precisely, let Dt be the distribution at round t, and ht : X x y ^ M. 
the weak rule acquired according to Dt- The sign of ht{x,l) is interpreted as 
a prediction of whether label I should be assigned to example x or not. The 
magnitude of the prediction \ht{x, l)\ is interpreted as a measure of confidence in 
the prediction. In order to understand correctly the updating formula this last 
piece of notation should be defined. Thus, given Y Qy and l^y, let Y[l] be 
if / and -1 otherwise. 

Now, it becomes clear that the updating function increases (or decreases) 
the weights I?t(i, /) for which ht makes a good (or bad) prediction, and that this 
variation is proportional to \ht(x, 



procedure AdaBoost.MH (in: S = {(x,, Fi)}™ i) 
### S is tlie set of training examples 

### Initialize distribution Di (for all i, 1 < i < m, and alH, 1 < Z < k) 

Di{i,l) = l/(mfe) 
for t:=l to T do 

### Get the weak hypothesis ht : X x y ^ R 

ht = WeakLeamer(X, A); 

### Update distribution Dt (for all i, 1 < i < m, and alH, 1 < Z < k) 
n r n Dt{i,l)eM-Yi[l]ht{xi,l)) 

L>t+i(i,l) = — 

### Zt is a normalization factor (chosen so that Dt+i will be a distribution) 
end-for t 
return the combined hypothesis: f{x,l) — ht{x,l) 
end AdaBoost.MH *=i 



Fig. 1. The AdaBoost.MH algorithm 



Note that WSD is not a multi-label classification problem since a unique sense 
is expected for each word in context. In our implementation, the algorithm runs 
exactly in the same way as explained above, except that sets Yi are reduced to 
a unique label, and that the combined hypothesis is forced to output a unique 
label, which is the one that maximizes f{x, I). 

Up to now, it only remains to be defined the form of the WeakLearner. 
Schapire and Singer prove that the Hamming loss of the AdaBoost.MH 
algorithm on the training setQ is at most J^j^j^ Zt, where Zf is the normalization 
factor computed on round t. This upper bound is used in guiding the design of 
the WeakLearner algorithm, which attempts to find a weak hypothesis ht that 
minimizes: Zt = J2iLiJ2iey Dt{i,l)e:iqy{-Yi[l]ht{xJ)) . 



2.1 Weak Hypotheses for WSD 

As in , very simple weak hypotheses are used to test the value of a boolean 
predicate and make a prediction based on that value. The predicates used, which 



are described in section 3.1, are of the form "/ = v" , where / is a feature and v is 
a value (e.g.: "previous_word = hospitaF). Formally, based on a given predicate 
p, our interest lies on weak hypotheses h which make predictions of the form: 

,x _ J Co/ if p holds in x 
\cii otherwise 

where the c,;'s are real numbers. 



^ i.e. the fraction of training examples i and labels I for which the sign of f{xi, I) differs 

from y»m. 



For a given predicate p, and bearing the minimization of Zt in mind, values Cji 
should be calculated as follows. Let Xi be the subset of examples for which the 
predicate p holds and let Xq be the subset of examples for which the predicate 
p does not hold. Let [tt], for any predicate tt, be 1 if tt holds and otherwise. 
Given the current distribution Dt, the following real numbers are calculated for 
each possible label /, for j e {0, 1}, and for bE {+1, —1}: 

= E" 1 Dt{t, mx, e X, A YM = bj 

That is, Wi^^i (Wl'^i) is the weight (with respect to distribution Dt) of the 
training examples in partition Xj which are (or not) labelled by I. 

As it is shown in pst , Zt is minimized for a particular predicate by choosing: 

These settings imply that: 

Thus, the predicate p chosen is that for which the value of Zt is smallest. 

Very small or zero values for the parameters VF^' cause Cji predictions to 
be large or infinite in magnitude. In practice, such large predictions may cause 
numerical problems to the algorithm, and seem to increase the tendency to 
overfit. As suggested in |2^, smoothed values for Cji have been used. 



3 Applying Boosting to WSD 
3.1 Corpus 

In our experiments the boosting approach has been evaluated using the DSO 
corpus containing 192,800 semantically annotated occurrences |of 121 nouns and 
70 verbs. These correspond to the most frequent and ambiguous English words. 
The DSO corpus was collected by Ng and colleagues and it is available from 
the Linguistic Data Consortium (LDC)^. 

For our first experiments, a group of 15 words (10 nouns and 5 verbs) which 
frequently appear in the related WSD literature has been selected. These words 
are described in the left hand-side of table |^. Since our goal is to acquire a 
classifier for each word, each row represents a classification problem. The number 
of classes (senses) ranges from 4 to 30, the number of training examples from 373 
to 1,500 and the number of attributes from 1,420 to 5,181. The MFS column on 
the right hand-side of table |^ shows the percentage of the most frequent sense 
for each word, i.e. the accuracy that a naive "Most-Frequent-Sense" classifier 
would obtain. 

^ These examples are tagged with a set of labels which correspond, with some minor 

changes, to the senses of WordNet 1.5 
^ LDC e-mail address: ldc@unagi.cis.upenn.edu 



The binary-valued attributes used for describing the examples correspond to 
the binarization of seven features referring to a very narrow linguistic context. 
Let "u>_2 W-i w W-fi w+2" be the context of 5 consecutive words around the 
word w to be disambiguated. The seven features mentioned above are exactly 
those used in |l9|: W-2, W-i, w+i, w+2, (u'-2,w_i), and (w+i,w+2), 

where the last three correspond to collocations of two consecutive words. 



3.2 Benchmark Algorithms and Experimental Methodology 

AdaBoost.MH has been compared to the following algorithms: 

Naive Bayes (NB). The naive Bayesian classifier has been used in its most 
classical setting To avoid the effect of zero counts when estimating the con- 
ditional probabilities of the model, a very simple smoothing technique has been 
used, which was proposed in | p^ . 

Exemplar— based learning (EB^). In our implementation, all examples are 
stored in memory and the classification of a new example is based on a k- 
NN algorithm using Hamming distance to measure closeness (in doing so, all 
examples are examined). If k is greater than 1, the resulting sense is the weighted 
majority sense of the k nearest neighbours (each example votes its sense with a 
strength proportional to its closeness to the test example). Ties are resolved in 
favour of the most frequent sense among all those tied. 

The comparison of algorithms has been performed in series of controlled 
experiments using exactly the same training and test sets for each method. 
The experimental methodology consisted in a 10-fold cross-validation. All ac- 
curacy/error rate figures appearing in the paper are averaged over the results 
of the 10 folds. The statistical tests of significance have been performed us- 
ing a 10-fold cross validation paired Student's i-test with a confidence value of: 
ig, 0.975 — 2.262. 



3.3 Results 

Figure || shows the error rate curve of AdaBoost.MH, averaged over the 15 
reference words, and for an increasing number of weak rules per word. This plot 
shows that the error obtained by AdaBoost.MH is lower than those obtained by 
NB and EB15 (fc=15 is the best choice for that parameter from a number of tests 
between k=l and fc=30) for a number of rules above 100. It also shows that the 
error rate decreases slightly and monotonically, as it approaches the maximum 
number of rules reported^ 

According to the plot in figure ^, no overfitting is observed while increasing 
the number of rules per word. Although it seems that the best strategy could 
be "learn as many rules as possible" , in it is shown that the number of 
rounds must be determined individually for each word since they have different 



* The maximum number of rounds considered is 750, merely for efficiency reasons. 
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Fig. 2. Error rate of AdaBoost.MH related to the number of weak rules 

behaviours. The adjustment of the number of rounds can be done by cross- 
validation on the training set, as suggested in However, in our case, this cross- 
validation inside the cross-validation of the general experiment would generate 
a prohibitive overhead. Instead, a very simple stopping criterion (sc) has been 
used, which consists in stopping the acquisition of weak rules whenever the error 
rate on the training set falls below 5%, with an upper bound of 750 rules. This 
variant, which is referred to as ABsc, obtained comparable results to AB750 but 
generating only 370.2 weak rules per word on average, which represents a very 
moderate storage requirement for the combined classifiers. 

The numerical information corresponding to this experiment is included in 
table 1^. This table shows the accuracy results, detailed for each word, of NB, 
EBi, EB15, AB750, and ABsc- The best result for each word is printed in boldface. 

As it can be seen, in 14 out of 15 cases, the best results correspond to the 
boosting algorithms. When comparing global results, accuracies of either AB750 
or ABsc are significantly greater than those of any of the other methods. Finally, 
note that accuracies corresponding to NB and EB15 are comparable (as suggested 
in ||l^), and that the use of fc's greater than 1 is crucial for making Exemplar- 
based learning competitive on WSD. 

4 Making Boosting Practical for WSD 

Up to now, it has been seen that AdaBoost.MH is a simple and competitive al- 
gorithm for the WSD task. It achieves an accuracy performance superior to that 
of the Naive Bayes and Exemplar-based algorithms tested in this paper. How- 
ever, AdaBoost.MH has the drawback of its computational cost, which makes 
the algorithm not scale properly to real WSD domains of thousands of words. 

The space and time-per-round requirements of AdaBoost.MH are 0{mk) 
(recall that m is the number of training examples and k the number of senses), 
not including the call to the weak learner. This cost is unavoidable since Ad- 
aBoost.MH is inherently sequential. That is, in order to learn the {t+l)-th. weak 
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Table 1. Set of 15 reference words and results of the main algorithms 



rule it needs the calculation of the t-th weak rule, which properly updates the 
matrix Dt- Further, inside the WeakLearner, there is another iterative process 
that examines, one by one, all attributes so as to decide which is the one that 
minimizes Zt- Since there are thousands of attributes, this is also a time consum- 
ing part, which can be straightforwardly spedup either by reducing the number 
of attributes or by relaxing the need to examine all attributes at each iteration. 

4.1 Accelerating the WeakLearner 

Four methods have been tested in order to reduce the cost of searching for weak 
rules. The first three, consisting in aggressively reducing the feature space, are 
frequently applied in Text Categorization. The fourth consists in reducing the 
number of attributes that are examined at each round of the boosting algorithm. 

Frequency filtering (Freq): This method consists in simply discarding those 
features corresponding to events that occur less than N times in the training 
corpus. The idea beyond that criterion is that frequent events are more infor- 
mative than rare ones. 

Local frequency filtering (LFreq): This method works similarly to Freq but 
considers the frequency of events locally, at the sense level. More particularly, it 
selects the N most frequent features of each sense. 

RLM ranking: This third method consists in making a ranking of all attributes 
according to the RLM distance measure O] and selecting the N most relevant 



features. This measure has been commonly used for attribute selection in deci- 
sion tree induction algorithms^. 

LazyBoosting: The last method does not filter out any attribute but reduces 
the number of those that are examined at each iteration of the boosting algo- 
rithm. More specifically, a small proportion p of attributes are randomly selected 
and the best weak rule is selected among them. The idea behind this method is 
that if the proportion p is not too small, probably a sufficiently good rule can 
be found at each iteration. Besides, the chance for a good rule to appear in the 
whole learning process is very high. Another important characteristic is that no 
attribute needs to be discarded and so we avoid the risk of eliminating relevant 
attribute^. 

The four methods above have been compared for the set of 15 reference words. 
Figure ^ contains the average error-rate curves obtained by the four variants at 
increasing levels of attribute reduction. The top horizontal line corresponds to 
the MFS error rate, while the bottom horizontal line stands for the error rate of 
AdaBoost.MH working with all attributes. The results contained in figure^ are 
calculated running the boosting algorithm 250 rounds for each word. 
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Fig. 3. Error rate obtained by the four methods, at 250 weak rules per word, 
with respect to the percentage of rejected attributes 



The main conclusions that can be drawn are the following: 



^ RLM distance belongs to the distance-based and information-based families of at- 
tribute selection functions. It has been selected because it showed better performance 
than seven other alternatives in an experiment of decision tree induction for PoS tag- 
ging jl4). 

® This method will be called LazyBoosting in reference to the work by Samuel and col- 
leagues . They applied the same technique for accelerating the learning algorithm 
in a Dialogue Act tagging system. 



• All methods seem to work quite well since no important degradation is ob- 
served in performance for values lower than 95% in rejected attributes. This 
may indicate that there are many irrelevant or highly dependent attributes 
in our domain. 

• LFreq is slightly better than Freq, indicating a preference to make frequency 
counts for each sense rather than globally. 

• The more informed RLM ranking performs better than frequency-based re- 
duction methods Freq and LFreq. 

• LazyBoosting is better than all other methods, confirming our expectations: 
it is worth keeping all information provided by the features. In this case, ac- 
ceptable performance is obtained even if only 1% of the attributes is explored 
when looking for a weak rule. The value of 10%, for which LazyBoosting 
still achieves the same performance and runs about 7 times faster than Ad- 
aBoost.MH working with all attributes, will be selected for the experiments 
in section pi 



5 Evaluating LazyBoosting 

The LazyBoosting algorithm has been tested on the full semantically annotated 



corpus with p = 10% and the same stopping criterion described in section 3.3, 
which will be referred to as AB/io^c- The average number of senses is 7.2 for 
nouns, 12.6 for verbs, and 9.2 overall. The average number of training examples 
is 933.9 for nouns, 938.7 for verbs, and 935.6 overall. 

The ABjiosc algorithm learned an average of 381.1 rules per word, and took 
about 4 days of CPU time to complete]^. It has to be noted that this time includes 
the cross-validation overhead. Eliminating it, it is estimated that 4 CPU days 
would be the necessary time for acquiring a word sense disambiguation boosting- 
based system covering about 2,000 words. 

The ABiiosc has been compared again to the benchmark algorithms using 



the 10-fold cross-validation methodology described in section |3^. The average 
accuracy results are reported in the left hand-side of table ^. The best figures 
correspond to the LazyBoosting algorithm ABjiosc, and again, the differences are 
statistically significant using the 10-fold cross-validation paired i-test. 



Accuracy (%) Wins-Ties-Losses 





IVIFS 


NB 


EBi5 


ABjiosc 


ABnosc vs. NB 


ABjiosc 


vs 


EBis 


Nouns (121) 


56.4 


68.7 


68.0 


70.8 


99(51) -1-21(3) 


100(68)- 


5- 


16(1) 


Verbs (70) 


46.7 


64.8 


64.9 


67.5 


63(35) -1-6(2) 


64(39) - 


2- 


4(0) 


Average (191) 


52.3 


67.1 


66.7 


69.5 


162(86) -2-27(5) 


164(107) - 


-7- 


-20(1) 



Table 2. Results of LazyBoosting and the benchmark methods on the 191-word 
corpus 

^ The current implementation is written in PERL-5.003 and it was run on a SUN 
UltraSparc2 machine with 194Mb of RAM. 



The right hand-side of the table shows the comparison of AB;iosc versus 
NB and EB15 algorithms, respectively. Each cell contains the number of wins, 
ties, and losses of competing algorithms. The counts of statistically significant 
differences are included in brackets. It is important to point out that EB15 only 
beats significantly AB;iosc in one case while NB does so in five cases. Conversely, 
a significant superiority of ABjiosc over EB15 and NB is observed in 107 and 86 
cases, respectively. 

6 Conclusions and Future Work 

In the present work, Schapire and Singer's AdaBoost.MH algorithm has been 
evaluated on the word sense disambiguation task, which is one of the hardest 
open problems in Natural Language Processing. As it has been shown, the boost- 
ing approach outperforms Naive Bayes and Exemplar-based learning, which rep- 
resent state-of-the-art accuracy on supervised WSD. In addition, a faster variant 
has been suggested and tested, which is called LazyBoosting. This variant allows 
the scaling of the algorithm to broad-coverage real WSD domains, and is as ac- 
curate as AdaBoost.MH. Further details can be found in an extended version of 
this paper 0. 

Future work is planned to be done in the following directions: 

• Extensively evaluate AdaBoost.MH on the WSD task. This would include 
taking into account additional attributes, and testing the algorithms in other 
manually annotated corpora, and especially on sense-tagged corpora auto- 
matically obtained from Internet. 

• Confirm the validity of the LazyBoosting approach on other language learning 
tasks in which AdaBoost.MH works well, e.g.: Text Categorization. 

• It is known that mislabelled examples resulting from annotation errors tend 
to be hard examples to classify correctly, and, therefore, tend to have large 
weights in the final distribution. This observation allows both to identify the 
noisy examples and use boosting as a way to improve data quality p^,p|. 
It is suspected that the corpus used in the current work is very noisy, so it 
could be worth using boosting to try and improve it. 
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